This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Epidemic curves

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics
               aweek,        # working with dates
               lubridate,    # Manipulate dates    
               incidence,    # an option for epicurves of linelist data
               stringr,      # Search and manipulate character strings
               forcats,      # working with factors
               RColorBrewer) # Color palettes from colorbrewer2.org

Load data

Two example datasets are used in this section:

  • Linelist of individual cases from a simulated Ebola epidemic
  • Aggregated counts by hospital from the same simulated Ebola epidemic

The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.

For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.

# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

Review the two datasets and notice the differences

Case linelist

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Case counts aggregated by hospital

# display the linelist data as a table
DT::datatable(count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Set parameters

You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). In this case, we set this date as 27 July 2013.

Now we can reference the object data_date into the code and have it reference that date.

## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2015-05-15")

Verify dates

Verify that each date column is class Date and has an approporiate range of values. This for loop prints a histogram for each column.

# create character vector of column names 
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))

# Produce histogram of each date column
for (Col in DateCols) {     # open loop. iterate for each name in vector DateCols
  hist(linelist[, Col],     # print histogram of the column in linelist dataframe
       breaks = 50,         # number of breaks for the histogram
       xlab = Col)          # x-axis label is the name of the column
  }                         # close the loop

incidence package

Below are tabs on making quick epicurves using the incidence package

CAUTION: Epicontacts expects data to be in a “linelist” format of one row per case (not aggregated). If your data is aggregated counts, look to the ggplot epicurves tab.

TIP: The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console.

https://cran.r-project.org/web/packages/incidence/vignettes/customize_plot.html#example-data-simulated-ebola-outbreak

Simple

The incidence package requires 2 steps to plot an epicurve:

  1. Create an incidence object (using the function incidence())
  2. Plot the incidence object

A simple example - an epicurve of daily cases:

# load package
library(incidence)

# create the incidence object using data by day
epi_day   <- incidence(linelist$date_onset, interval = "day")

# plot the incidence object
plot(epi_day)

#### Change aggregation of cases

The interval defines how the observations are grouped. Available options include all options in the package aweek, including but not limited to:

  • “Monday week”
  • “2 Monday weeks”
  • “Sunday week”
  • “MMWRweek” (starts on Sunday)
  • “Month”
  • “Quarter”
  • “Year”
# Weekly
epi_wk  <- incidence(linelist$date_onset, interval = "Monday week")

# MMWR week
epi_MMWRwk  <- incidence(linelist$date_onset, interval = "MMWRweek")

# Three weeks
epi_3wk  <- incidence(linelist$date_onset, interval = "3 weeks")

# Monthly
epi_month <- incidence(linelist$date_onset, interval = "month")

# Plot the incidence objects
plot(epi_wk)
plot(epi_MMWRwk)
plot(epi_3wk)
plot(epi_month)

The first date and last date of the plot can also be specified.

Modifications

The incidence package enables modifications in the following ways:

  • Arguments of plot()
  • scale_x_incidence()
  • ggplot() additions

Read details in the Help files by entering ?scale_x_incidence and ?plot.incidence in the R console. Online vignettes are listed in the resources tab.

Aesthetic themes and other ggplot() syntax can be added with the + symbol, because incidence is using ggplot() in the background.

Show individual cases

For a smaller outbreak, it may be important to show boxes around each individual case. Use the argument show_cases = TRUE in the plot() function.

Create the weekly incidence object for a smaller outbreak (only cases from Central Hospital)

# create filtered dataset for Central Hospital
central_data  <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object (weekly)
central_outbreak <- incidence(central_data$date_onset, interval = "Monday week")

And now plot the epicurve:

# plot outbreak, showing each case
plot(central_outbreak,
     show_cases = T)                 # show boxes around each individual case

The same epicurve showing individual cases, but with many other modifications:


# add plot() arguments and ggplot() commands
plot(central_outbreak,
     show_cases = T,                 # show boxes around each individual case
     #labels_week = T,               # if TRUE: x-axis dates display as weeks ("YYYY-Www") 
     color = "lightblue",            # color inside boxes
     border = "darkblue",            # border color of boxes
     alpha = 0.5,                    # transparency
     xlab = "Date of onset",         # x-axis label
     ylab = "Weekly reported cases", # y-axis label
     n_breaks = 7)+                  # to override auto number of x-axis breaks
  
  # ggplot() commands added to the plot
  ggtitle("Weekly case incidence at Central Hospital")+ # add title separately
  scale_x_date(date_labels = "%d %b")+                  # adjust how dates are displayed
  scale_y_continuous(breaks = seq(0, 30, 5))+           # adjust y-axis intervals
  theme_minimal()+                                      # simplify background
  theme(axis.title = element_text(size = 12, face = "bold"),
        plot.caption = element_text(face = "italic", hjust = 0))+
  labs(#x = "", 
       #y = "", 
       title = "Weekly case incidence at Central Hospital",
       #subtitle = "",
       caption = stringr::str_glue("Source: Linelist data from: {data_date}
                                   {nrow(linelist %>% filter(is.na(date_onset)))} cases are missing date of onset and not shown"))

Adjusting axes

Note in the above that the grey gridlines do not align with the case boxes. This is because the gridlines show the 1st of a month, whereas the case boxes are by week. Weeks may overlap the start of months. To align them, include `labels_week = T in plot() and do not specify a scale_x_date().

TIP: While the data may be grouped by day or week, you can quickly reduce the number of labels appearing on the x-axis using scale_x_incidence() and specifying the number of axis breaks.

DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.

To enforce a x-axis label for every bin (daily, weekly, etc.), set the n_breaks = argument in plot() to the number of rows in your incidence object: n_breaks = nrow(epi_wk)

scale_x_date(date_breaks = , date_labels = ) discuss. TO DO

Color by group

To color the cases by group, provide the column to the groups = argument in the incidence() command. In the example below the cases are colored by their generation in the overall transmission chain.

Note the incidence() argument na_as_group = which if FALSE will prevent missing values (NA) from being categorized as their own group.

# Create epiweek object, with counts grouped by gender
gen_outbreak <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$generation,
                               na_as_group = FALSE)   # Missing values not assigned their own group


# plot the epicurve
plot(gen_outbreak) 

Changing colors and legend

To change the legend, use ggplot() commands such as theme(legend.position = "top", legend.direction = "horizontal", legend.title = element_blank()). See the page of ggplot() tips for more details.

To change the colors:

Change the color palette to one of the default base R palettes using the argument col_pal in plot() (do not put the name of the palette in quotes).

Other palettes include TO DO add page with palette names… To DO

plot(gen_outbreak, col_pal = rainbow)

To specify colors manually, provide the name of the color or a character vector of colors to the argument color =. Note to function properly the number of colors must equal the number of groups (be aware of missing values as a group)

hosp_outbreak <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$hospital,
                               na_as_group = FALSE)   # Missing values not assigned their own group
# default colors
plot(hosp_outbreak)

# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))

Filtered

To plot the epicurve of a subset of data, filter the data first and then use the subset in the incidence() command. The example below uses data filtered to show only female cases.

# filter the dataset
female_data <- filter(linelist, gender == "f")

# create incidence object using female-only data
female_outbreak <- incidence(female_data$date_onset, interval = "week")

# plot
plot(female_outbreak, color = "orange", border = "black", ylab = "Weekly case incidence")

Facets/small multiples

To facet the plot by a variable (make “small multiples”), see the tab on epicurves with ggplot()

ggplot2

Below are tabs on using “ggplot2” package

Daily and weekly


# Daily case counts 
###################
plot_daily <- ggplot(linelist, aes(x = date_onset)) + 
  
  # stacked bars, bined by day (1 days)
  stat_bin(binwidth = 1, position="stack") 

print(plot_daily)


# Weekly case counts 
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position="stack", fill = "brown") 

print(plot_weekly)

With aesthetic themes

# Preparation
#############
# Create epiweek variable. Factor argument automatically includes all weeks in span. Numeric shows just the week number.
linelist$epiweek <- aweek::date2week(linelist$date_onset, factor = TRUE, numeric = TRUE)

# Calculate maximum number of cases in an epiweek, to get the maximum y-axis height (also helps with uniformity in multiple plots)
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))

# Weekly case counts 
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position = "stack", fill = "grey", color = "black") +
  
  # X-axis 21-day labels
  scale_x_date( 
    # Sets date label breaks as every 3 weeks from Monday before the first case
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm=T), by = "3 weeks"),
    
    # axis limits determined by max/min + buffer
    limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), 
    
    # displays as date number, then abbreviated month (e.g. 12 Oct)
    date_labels = "%d-%b",   
    
    # sets origin at (0,0)
    expand = c(0,0)) +                                
  
  # Y-axis breaks every 5 cases
  scale_y_continuous(breaks = seq(0, ymax, 25),
                     limits = c(0, ymax),
                     expand = c(0, 0)) +
  
  # Theme specifications (axis, text, etc.)
  theme(# title
        plot.title = element_text(size=20, hjust= 0, face="bold"),   # title size, font, bold
        # axes
        axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
        axis.text = element_text(size=12),
        axis.title = element_text(size=14, face="bold"),
        axis.line = element_line(colour="black"),
        # background
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        # caption (italics, on right side)
        plot.caption = element_text(hjust = 0, face = "italic")
        ) +
  
  guides(fill = guide_legend(reverse = TRUE,                   # Orders Non-active zones at end of legend
                             override.aes = list(size = 0.2),
                             ncol = 2)) +                        # Number of legend columns
  
  labs(x = "Week of illness onset", 
       y = "Number of cases",
       subtitle = "subtitle here")
  
  ggtitle("Epidemic curve")
## $title
## [1] "Epidemic curve"
## 
## attr(,"class")
## [1] "labels"


# print
print(plot_weekly)

By category

Colored by a category

# Setup
########
# Two known classes (select colors from colorbrewer2.org)
colors_overall = c("#d95f02",  # 
                   "#1b9e77",
                   "#7570b3")  #    

# Order sex variable by reverse # of cases, so plot stacks with smallest # of cases at top
linelist$gender <- factor(linelist$gender, 
                         levels = levels(fct_rev(fct_infreq(linelist$gender))))

# Calculates maximum yaxis height for uniformity between the two graphs
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(linelist[is.na(linelist$date_onset),])



# PLOT - BY ONSET DATE
######################
plot_defined_cats <- ggplot(linelist, aes(x = date_onset, fill = gender)) + 
  
  # stacked bars, width of 7 days
  stat_bin(binwidth = 7, position = "stack") +
  
  # Colors and labels of confirmed/probable
  scale_fill_manual(values = rev(colors_overall),
                    labels = str_to_sentence(levels(factor(linelist$gender)))) +
  
  # X-axis scale labels (not aggregation, just the labels)
  scale_x_date(# Sets date label breaks as every week
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "3 weeks"),
    limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date # then abbreviated month (e.g. 12 Oct)
    expand = c(0, 0)) +                                # sets origin at (0,0)
  
  # Y-scale in breaks, up to the ymax previously defined
  scale_y_continuous(breaks = seq(0, 500, 25), limits = c(0, ymax), expand=c(0, 0)) +
  
  # Themes for axes, titles, background, etc.
  theme(plot.title       = element_text(size=20, hjust=0.5, face="bold"),
        axis.text        = element_text(size=12),
        axis.title       = element_text(size=14, face="bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line        = element_line(colour = "black"),
        axis.text.x      = element_text(angle=90, vjust=0.5, hjust=1)) +
  
  # Legend specifications
  theme(legend.title           = element_blank(),
        legend.justification   = c(0, 1), 
        legend.position        = c(0.09, 0.98),
        legend.background      = element_blank(),
        legend.text            = element_text(size = 12)) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
  
  # Axis and caption labels
  labs(x = "Week of illness onset",
       y = "Number of Cases") +
  
  # Title
  ggtitle("Cases by week of illness onset")

# print
print(plot_defined_cats)

By active area

# PARAMETERS
#############

# Maximum y-value for epiweek (this will be larger than necessary because of missing onset dates)
ymax <- max(table(linelist$epiweek))

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))


# SETUP - ACTIVE/NON-ACTIVE ZONES
#################################
# List of "active" zones with a case in the date range
active_zones <- unique(linelist$province[which(linelist$date_onset > (data_date - 90))])
active_zones

# Table of active zones and their overall number of cases (for ordering their stacked appearance)
order_table <- linelist %>%
  filter(province %in% active_zones) %>%
  group_by(province) %>%
  summarise(cases = n())
order_table

# Create TRUE/FALSE variable for "active" health zones
linelist$active_zone <- ifelse(linelist$province %in% active_zones, TRUE, FALSE)

# Create list of non-active HZ names for bottom of plot
other_zone_names <- unique(sort(linelist$province[linelist$active_zone == FALSE]))

# Make variable for graph categories, including a level for "non-active" zones
linelist$graph_zone <- factor(case_when(
  
  # Value assignments
  # Non-active zones
  linelist$active_zone == FALSE    ~ "Non-active zones",
  # All others are assigned their names, capitalized
  TRUE  ~ stringr::str_to_title(linelist$province)),
  
  # Order of variable levels
  levels = c(
    # "Non-active zones" is first level
    "Non-active zones",  
    
    # Orders active zones by their frequency in linelist, reversed, so most-affected zones are on the BOTTOM of plot
    str_to_title(rev(levels(fct_infreq(as.factor(linelist$province[linelist$active_zone == TRUE])))))))  


table(linelist$graph_zone, useNA = "ifany")


# COLORS
########
# Number of unique values in graph_zone variable, minus 1 (for non-active, which is added later as grey (#cccccc))
colors_needed <- length(unique(linelist$graph_zone, na.rm=T)) - 1 

# List of possible colors (see colorbrewer2.com, qualitative scheme)
colors_linelist = c(#"#cccccc", # first = non-active grey color
                "#1b9e77", # turquoise green
                "#ff7f00", # orange
                "#ffff33", # yellow
                "#6a3d9a", # purple
                "#b15928", # brown
                "#1f78b4", # blue
                "#e31a1c", # red,
                "#fb9a99", # pink
                "#b2df8a", # light green 
                "#cab2d6", # light purple
                "#a6cee3", # light blue
                "#fdbf6f", # beige
                "#33a02c"  # green
)

# Reduce number of colors to only the number needed
colors_linelist <- c("#cccccc", rev(colors_linelist[1:colors_needed]))


# MAKE GRAPH
#############
plot_overall <- ggplot(linelist, aes(x = date_onset, fill = graph_zone)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position = "stack") +
  
  # Fill of bars
  scale_fill_manual(values = colors_linelist, 
                    labels = str_to_sentence(levels(factor(linelist$graph_zone)))) +
  
  # X-axis 21-day labels
  scale_x_date( # Sets date label breaks as every 3 weeks from Monday before the first case
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "3 weeks"),
    limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date number, then abbreviated month (e.g. 12 Oct)
    expand = c(0,0)) +                                # sets origin at (0,0)
  
  # Y-axis breaks every 5 cases
  scale_y_continuous(breaks = seq(0, ymax, 25),
                     limits = c(0, ymax),
                     expand = c(0, 0)) +
  
  # Theme specifications (axis, text, etc.)
  theme(plot.title = element_text(size = 20, hjust = 0, face = "bold"),   # title size, font, bold
        axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
        axis.text = element_text(size=12),
        axis.title = element_text(size=14, face="bold"),
        axis.line = element_line(colour="black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.caption = element_text(hjust = 0, face = "italic")
        ) +
  
  # Legend specifications
  theme(legend.title = element_blank(),                        # No legend title
        legend.position = c(0.20, 0.85),                         # placement of legend
        legend.background = element_blank(),                     # legend background   
        legend.text = element_text(size=12)) +                 # legend text size
  
  guides(fill = guide_legend(reverse = TRUE,                   # Orders Non-active zones at end of legend
                             override.aes = list(size = 0.2),
                             ncol = 2)) +                        # Number of legend columns
  
  labs(x = "Week of illness onset", 
       y = "Number of cases",
       subtitle = "Health zones with cases in the last 42 days specified by color",
       caption = paste0(nrow(linelist),
                        " confirmed and probable cases, reported as of ", data_date, ". ",
                        missing_onset, " cases missing date of onset and not shown.",
                        "\nNon-active zones include: ", str_to_title(toString(unique(linelist$province[linelist$active_zone == FALSE]))))) +
  
  ggtitle("Epidemic curve by active health zones")


# print
plot_overall

By grouped areas


#SETUP
#############
# Filter to health zone of interest
zone_data <- linelist

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))

# Assign health area groups (individual for HAs of interest, groups others together)
linelist$graph_areas <- factor(case_when(
  linelist$province == "Shanghai"    ~ "Shanghai",
  linelist$province == "Jiangsu"     ~ "Jiangsu",
  linelist$province == "Zhejiang"    ~ "Zhejiang",
  TRUE                                ~ "Other (10)"
),
# Levels part of the factor function assigns order of appearance
levels = c(                         
  "Other (10)",
  "Shanghai",
  "Jiangsu",
  "Zhejiang"
)
)

# checks
table(linelist$graph_areas, useNA = "ifany")

# Color assignments
colors_needed <- length(unique(linelist$graph_areas, na.rm=T)) - 1 # number of colors needed

# list of colors
colors_aire = c("#a6cee3", 
                           "#1f78b4",
                           "#b2df8a", 
                           "#33a02c",
                           "#fb9a99",
                           "#e31a1c",
                           "#fdbf6f",
                           "#ff7f00",
                           "#cab2d6",
                           "#6a3d9a",
                           "#ffff99",
                           "#b15928"
                           )

# Reduce number of colors to only the number needed
colors_aire <- c("#cccccc", rev(colors_aire[1:colors_needed]))


# Plot of province
#####################################
plot <- ggplot(linelist, aes(x = date_onset, fill = graph_areas)) + 
  
  stat_bin(binwidth = 7, position="stack") +
  
  scale_fill_manual(values = colors_aire, labels = str_to_sentence(levels(factor(linelist$graph_areas)))) +
  
  scale_x_date(date_breaks = "1 week", date_labels = "%d-%b", limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), expand=c(0,0)) + # I used the date onset variable here so x axes will be the same
  
  scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
  
  theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
        plot.caption = element_text(hjust = 0, face = "italic"),
        
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        
        
        axis.line = element_line(colour = "black"),
        axis.text = element_text(size = 12),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        axis.title = element_text(size = 14, face = "bold"),
        
        legend.title = element_blank(),
        legend.justification = c(0,1), 
        legend.position = c(0.05, 1),
        legend.background = element_blank(),
        legend.text = element_text(size = 12)) +
  
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 4)) +
  
  labs(x="Week of illness onset", 
       y="Number of cases",
       subtitle = "",
       caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset.")) +
  
  ggtitle("Cases of influenza, by province")

plot

Faceting

By “wave” in time

# Define the waves
##################
# zone_data <- filter(linelist, zone_de_sante == "mabalako")
# 
# zone_data$wave <- case_when(
#   zone_data$date_onset >= as.Date("2018-03-01") &
#     zone_data$date_onset < as.Date("2018-10-25")     ~ "Wave 1",
#   
#   zone_data$date_onset >= as.Date("2018-10-25") &
#     zone_data$date_onset < as.Date("2019-02-01")     ~ "Wave 2",
#   
#   zone_data$date_onset >= as.Date("2019-02-01") &
#     zone_data$date_onset < as.Date("2019-09-15")     ~ "Wave 3",
#   
#   zone_data$date_onset >= as.Date("2019-09-15")      ~ "Wave 4",
#   
#   TRUE ~ NA_character_
# )
# 
# table(is.na(zone_data$date_onset))
# table(zone_data$wave, useNA = "always")
# 
# 
# # Color assignments
# colors_needed <- length(unique(zone_data$wave, na.rm=T)) # number of colors needed
# 
# # list of colors
# colors_aire = c("#a6cee3", 
#                            "#1f78b4",
#                            "#b2df8a", 
#                            "#33a02c",
#                            "#fb9a99",
#                            "#e31a1c",
#                            "#fdbf6f",
#                            "#ff7f00",
#                            "#cab2d6",
#                            "#6a3d9a",
#                            "#ffff99",
#                            "#b15928"
#                            )
# 
# # Reduce number of colors to only the number needed
# colors_aire <- c(rev(colors_aire[1:colors_needed]))
# 
# 
# # Plot of health zone colored by wave
# #####################################
# plot_Mabalako <- ggplot(zone_data, aes(x = date_onset, fill = wave)) + 
#   
#   stat_bin(binwidth = 7, position = "stack") +
#   
#   scale_fill_manual(values = rev(colors_aire), labels = str_to_sentence(levels(factor(zone_data$wave)))) +
#   
#   scale_x_date(date_breaks = "21 days", date_labels = "%d-%b",
#                limits = c((min(zone_data$date_onset, na.rm = T) - 8), (max(zone_data$date_report, na.rm = T) + 8)), expand = c(0,0)) +
#   
#   scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
#   
#   theme(text = element_text(family = "Segoe Condensed"),
#         axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
#         axis.text = element_text(size = 12),
#         axis.title = element_text(size = 14, face = "bold"),
#         axis.line = element_line(colour = "black"),
#         
#         plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
#         plot.caption = element_text(hjust = 0, face = "italic"),
#         
#         panel.grid.major = element_blank(),
#         panel.grid.minor = element_blank(),
#         panel.background = element_blank(),
#         
#         legend.title = element_blank(),
#         legend.justification = c(0,1), 
#         legend.position = c(0.75, 0.98),
#         legend.background = element_blank(),
#         legend.text = element_text(size=12)) +
#   
#   guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 1)) +
#   
#   labs(x="Week of illness onset", 
#        y="Number of cases",
#        subtitle = "",
#        caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset and 16 excluded due to uncertain health zone of report.")) +
#   
#   ggtitle("Four waves of EVD in Mabalako health zone")
# 
# plot_Mabalako
# 
# 
# # Produce table describing each wave
# ####################################
# table <- zone_data %>%
#   select("aire_de_sante", "wave", "community_death", "date_onset", "cte_date", "epicasedef", "community_death", "contact_registered", "contact_surveilled") %>%
#   group_by(wave) %>%
#   summarise(first_onset       = min(date_onset, na.rm = T),
#             last_admission    = max(cte_date, na.rm = T),
#             n                 = n(),
#             confirmed         = sum(epicasedef == "confirmed"),
#             community_deaths  = paste0(sum(community_death    == 1), 
#                                        " (", round(100*sum(community_death == 1)/confirmed),"%)"),
#             reg_contacts      = paste0(sum(contact_registered == "yes"),
#                                        " (", round(100*sum(contact_registered == "yes")/confirmed),"%)"),
#             surv_contacts     = paste0(sum(contact_surveilled == "yes"),
#                                        " (", round(100*sum(contact_surveilled == "yes")/confirmed),"%)"),
#             top               = paste(toupper(names(sort(table(aire_de_sante),decreasing=TRUE)[1:3])), collapse=", ",
#                                       round(100*(sort(table(aire_de_sante),decreasing=TRUE)[1:3]/confirmed)), "%"),
#             health_areas      = paste(toupper(unique(aire_de_sante)), collapse=', ') 
#             )
# 
# kable(table)

Aggregated data

Situation

Often you do not have linelist data, but instead daily case counts from facilities, districts, etc. You can plot these in an epidemiological curve, but the code will be slightly different.

This section will utilize the counts_data dataset that was imported earlier, in the data preparation section.

Note: The incidence package does not support aggregate data

Clean dates

As before, we must ensure date variables are correctly classified.

# Convert Date variable to Date class
class(count_data$date_hospitalisation)
## [1] "Date"
count_data$date_hospitalisation <- as.Date(count_data$date_hospitalisation)

Create weeks

# Create epiweek variable
# aweek weeks are also stored as dates, facilitating better display manipulation
count_data$epiweek <- aweek::date2week(count_data$date_hospitalisation,      # use the Date variable
                                        week_start = "Monday", # epiweek begins on Monday
                                        floor_day = TRUE,      # only display year and week #
                                        factor = TRUE)         # expand to include all possible weeks

Clean dates

ggplot(data = count_data, aes(x = as.Date(epiweek), y = n_cases, group = hospital, fill = hospital))+
     geom_bar(stat = "identity")+
     
     # LABELS for x-axis
     scale_x_date(date_breaks = "1 month",  # displays by month
                  date_labels = '%b%d\n%Y')+  #labeled by month with year below
     
     # Choose color palette (uses RColorBrewer package)
     scale_fill_brewer(palette = "Pastel1")+      
     
     # Theme specifications (axis, text, etc.)
     theme(
          # title
          plot.title = element_text(size=20, hjust= 0, face="bold"),   # title size, font, bold
          # axes
          axis.text.x = element_text(angle=0, vjust=0.5, hjust=1),
          axis.text = element_text(size=12),
          axis.title = element_text(size=14, face="bold"),
          axis.line = element_line(colour="black"),
          # background
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          # caption (italics, on right side)
          plot.caption = element_text(hjust = 0, face = "italic"))+
     
     # labels
     labs(x = "Week of report", 
          y = "Number of cases",
          subtitle = "Cases aggregated by week and shown by hospital",
          caption = "Data source: XXXXX")+
     
     ggtitle("Epidemic curve of disease X in fictional location")

Dual-axis

Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.

In ggplot it is very difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.

Second axis

This uses the linelist dataset

TODO not complete yet

library(reshape2)
# group the data by week, summarize counts by group (gender)
linelist_week <- linelist %>%
     mutate(onset_epiweek = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
     group_by(onset_epiweek) %>%
     summarize(num_male = sum(gender == "m"),
               num_female = sum(gender == "f"),
               pct_male = round(100*(num_male / n())),
               med_age  = median(as.numeric(age), na.rm=T)
               ) 
# remove pct and melt
linelist_week_melted <- linelist_week %>%
     select(-c("pct_male", "med_age")) %>%
     melt(id.vars = c("onset_epiweek"))

# merge together (multiple of the same values in week will attach to melted)
linelist_week_melted <- merge(linelist_week_melted,
                              linelist_week,
                              by = "onset_epiweek")

second_axis <- ggplot(linelist_week_melted,
                      aes(x = as.Date(onset_epiweek),
                          y = value, group = variable,
                          fill = variable)) + 
  
  # bars
  geom_bar(stat = "identity")+
  
  # Colors and labels of confirmed/probable
  scale_fill_manual(values = c("blue", "red"),
                    labels = str_to_sentence(levels(factor(linelist_week_melted$variable)))) +
  
  geom_line(mapping = aes(y = pct_male, color = "% male"), size = 0.5) +
     
  scale_color_manual(values = "black")+
     
  scale_y_continuous(sec.axis = sec_axis(~(./sum(linelist_week_melted$value, na.rm = T)*100), name = "name here", breaks = seq(0, 100, 20)))+


  # X-axis scale labels (not aggregation, just the labels)
  scale_x_date(# Sets date label breaks as every week
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
    limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date # then abbreviated month (e.g. 12 Oct)
    expand = c(0, 0)) +                                # sets origin at (0,0)
  
  # Y-scale in breaks, up to the ymax previously defined
  scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
  
  # Themes for axes, titles, background, etc.
  theme(plot.title       = element_text(size=20, hjust=0.5, face="bold"),
        axis.text        = element_text(size=12),
        axis.title       = element_text(size=14, face="bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line        = element_line(colour = "black"),
        axis.text.x      = element_text(angle=90, vjust=0.5, hjust=1)) +
  
  # Legend specifications
  theme(legend.title           = element_blank(),
        legend.justification   = c(0, 1), 
        legend.position        = c(0.09, 0.98),
        legend.background      = element_blank(),
        legend.text            = element_text(size = 12)) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
  
  # Axis and caption labels
  labs(x = "Week of illness onset",
       y = "Number of Cases",
       caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
  
  # Title
  ggtitle("Cases by week of illness onset")

second_axis

# print
print(plot_defined_cats)

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.